{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/gpl/00-download-cord-19.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/gpl/00-download-cord-19.ipynb)\n",
    "\n",
    "We start by downloading the CORD-19 dataset from Allen Institute for AI, either manually from [here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html) or by running the code block below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "url = \"https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-02-07.tar.gz\"\n",
    "\n",
    "with requests.get(url, stream=True) as r:\n",
    "    with open('cord-19.tar.gz', 'wb') as fp:\n",
    "        for chunk in r.iter_content(chunk_size=1024):\n",
    "            fp.write(chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both methods create a *tar.gz* file, we need to extract the files stored within here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tarfile\n",
    "\n",
    "with tarfile.open('cord-19.tar.gz', 'r:gz') as tar:\n",
    "    tar.extractall()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This extracts a new directory named as the date of the dataset, in our case that is *2022-02-07*. In this directory we have a another *tar.gz* file called *document_parses.tar.gz*, which contains the data we will be using. Again, we extract this with `tarfile` again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "with tarfile.open('2022-02-07/document_parses.tar.gz') as tar:\n",
    "    tar.extractall()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This does require **~59GB** of disk space. If you cannot extract the full amount due to space restrictions, it's completely fine to use a significantly smaller set of documents.\n",
    "\n",
    "After extraction we should find a new directory *document_parses* that contains two file directories:\n",
    "\n",
    "* *pdf_json* (34.3GB)\n",
    "* *pmc_json* (24.6GB)\n",
    "\n",
    "Both are large enough for our use-case, so we will stick with just *pdf_json*. Let's load a file from here to understand the format being used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "338451\n",
      "document_parses\\pdf_json\\0000028b5cc154f68b8a269f6578f21e31f62977.json\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "paths = [\n",
    "    str(path) for path in Path('document_parses/pdf_json').glob('*.json')\n",
    "]\n",
    "print(str(len(paths))+\"\\n\"+paths[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have 338,450 PDF files in JSON format. We will take the first file to understand their structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "with open(paths[0], 'r') as fp:\n",
    "    data = json.load(fp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'paper_id': '0000028b5cc154f68b8a269f6578f21e31f62977',\n",
       " 'metadata': {'title': '\"Multi-faceted\" COVID-19: Russian experience',\n",
       "  'authors': []},\n",
       " 'abstract': [],\n",
       " 'body_text': [{'text': 'According to current live statistics at the time of editing this letter, Russia has been the third country in the world to be affected by COVID-19 with both new cases and death rates rising. It remains in a position of advantage due to the later onset of the viral spread within the country since the worldwide disease outbreak.',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'The first step in \"fighting\" the epidemic was nationwide lock down on March 30 th , 2020.',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'Most of the multidisciplinary hospitals have been repurposed as dedicated COVID-19 centres, so the surgeons started working as infectious disease specialists. Such a reallocation of health care capacity results in the effective management of this epidemiological problem 1 . The staff has undergone on-line 36-hour training course to become qualified in coronavirus infection treatment.',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'The surgeons of COVID-19 dedicated hospitals do rarely practice surgery. When ICU patients need mechanical ventilation, percutaneous tracheostomy under endoscopic control is mostly performed, as it decreases the aerosol formation, viral load on staff and complications, associated with an endotracheal tube in comparison with surgical tracheostomy 2 . However, it is still associated with the risk of aerosol formation, so different approaches should be considered for a long-time perspective 3 .',\n",
       "   'cite_spans': [{'start': 493,\n",
       "     'end': 494,\n",
       "     'text': '3',\n",
       "     'ref_id': 'BIBREF0'}],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'The majority of the studies dedicated to colorectal diseases are temporarily paused. The teaching and training are mostly translated via online platforms, which has excluded the opportunity to get clinical experience in surgery 4 .',\n",
       "   'cite_spans': [{'start': 228,\n",
       "     'end': 229,\n",
       "     'text': '4',\n",
       "     'ref_id': 'BIBREF1'}],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'The approach to patient routing has changed significantly. If one is not diagnosed with COVID-19 CT scan and laboratory testing are provided immediately. The patients should be admitted to the surgical department, where treatment is provided only to those COVID-19 negative.',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'The patient isolated for more than 2 weeks and COVID-19 negative as a result of 2 subsequent tests is admitted to the surgical department with an option to readmission to the infectious department and can be treated by surgical staff, which does not work with COVID-19 positive patients.',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'The patient, diagnosed with coronavirus infection and treated at home is admitted to COVID-19 dedicated multidisciplinary hospital, where surgical care is provided. Those treated in infectious diseases hospital or COVID-19 dedicated centre are managed by the surgical team present.',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'Surgery has become highly elective, being mostly available for high-risk patients with emergencies, malignancies, cardiovascular pathologies or infections. Preoperative testing in surgical patients with respiratory symptoms and history of travelling or contacting with COVID-19 positive people and postoperative recovery in the operating unit seem to be highly effective measures 5 . A lot of rearrangements are performed locally regarding personal protective equipment, the organization of scrubbing, donning and doffing, and dedicated changing areas. Moreover, observational departments are organized in surgical hospitals for patient allocation before coronavirus infection status is defined 6 .',\n",
       "   'cite_spans': [{'start': 380, 'end': 381, 'text': '5', 'ref_id': 'BIBREF2'},\n",
       "    {'start': 695, 'end': 696, 'text': '6', 'ref_id': 'BIBREF3'}],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'Surgery for benign disorders, precancerous lesions, and reconstructive procedures are currently postponed. Regarding colorectal cancer, surgical treatment may be postponed, if it is a non-obstructing disease 7 .',\n",
       "   'cite_spans': [{'start': 208,\n",
       "     'end': 209,\n",
       "     'text': '7',\n",
       "     'ref_id': 'BIBREF4'}],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'Laparoscopic surgery and diathermy are limited as well. The importance of special operating theatre for COVID-19 patients with negative pressure ventilation, unidirectional laminar flow, as well as the use of smoke evacuation systems during surgery are taken into account 8 .',\n",
       "   'cite_spans': [{'start': 272,\n",
       "     'end': 273,\n",
       "     'text': '8',\n",
       "     'ref_id': 'BIBREF5'}],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'Such an electiveness of surgery is concerning, as it might cause a worldwide healthcare catastrophe in the post-pandemic era 5 . More efforts should be taken to expand the amount and types of surgical procedures performed.',\n",
       "   'cite_spans': [{'start': 125,\n",
       "     'end': 126,\n",
       "     'text': '5',\n",
       "     'ref_id': 'BIBREF2'}],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'},\n",
       "  {'text': 'Due to the early preventive and corrective actions we have already reached the plateau in new cases curve, counting for up to 8 984 cases identified at the time of writing this paper (June 7 th , 2020), with a mortality rate of 1⋅5075%. These statistical outcomes are resulted by a 68-day lockdown, admission regime, and healthcare rearrangement. Thus, the multistep restriction lifting has already started to consistently recover in both social and economic aspects. ',\n",
       "   'cite_spans': [],\n",
       "   'ref_spans': [],\n",
       "   'section': 'Editor'}],\n",
       " 'bib_entries': {'BIBREF0': {'ref_id': 'b0',\n",
       "   'title': 'A hybrid approach to tracheostomy in COVID-19 patients ensuring staff safety',\n",
       "   'authors': [{'first': 'L', 'middle': [], 'last': 'Tanaka', 'suffix': ''},\n",
       "    {'first': 'M', 'middle': [], 'last': 'Alexandru', 'suffix': ''},\n",
       "    {'first': 'S', 'middle': [], 'last': 'Jbyeh', 'suffix': ''},\n",
       "    {'first': 'C', 'middle': [], 'last': 'Desbrosses', 'suffix': ''},\n",
       "    {'first': 'Z', 'middle': [], 'last': 'Bouzit', 'suffix': ''},\n",
       "    {'first': 'G', 'middle': [], 'last': 'Cheisson', 'suffix': ''}],\n",
       "   'year': 2020,\n",
       "   'venue': 'Br J Surg',\n",
       "   'volume': '102',\n",
       "   'issn': '',\n",
       "   'pages': '253--254',\n",
       "   'other_ids': {}},\n",
       "  'BIBREF1': {'ref_id': 'b1',\n",
       "   'title': 'Medical education: COVID-19 and surgery',\n",
       "   'authors': [{'first': 'S', 'middle': [], 'last': 'Khan', 'suffix': ''},\n",
       "    {'first': 'A', 'middle': [], 'last': 'Mian', 'suffix': ''}],\n",
       "   'year': 2020,\n",
       "   'venue': 'Br J Surg',\n",
       "   'volume': '107',\n",
       "   'issn': '',\n",
       "   'pages': '',\n",
       "   'other_ids': {}},\n",
       "  'BIBREF2': {'ref_id': 'b2',\n",
       "   'title': 'Elective surgeries during the COVID-19 outbreak',\n",
       "   'authors': [{'first': 'J', 'middle': [], 'last': 'Lee', 'suffix': ''},\n",
       "    {'first': 'J', 'middle': ['Y'], 'last': 'Choi', 'suffix': ''},\n",
       "    {'first': 'M', 'middle': ['S'], 'last': 'Kim', 'suffix': ''}],\n",
       "   'year': 2020,\n",
       "   'venue': 'Br J Surg',\n",
       "   'volume': '107',\n",
       "   'issn': '',\n",
       "   'pages': '',\n",
       "   'other_ids': {}},\n",
       "  'BIBREF3': {'ref_id': 'b3',\n",
       "   'title': 'Global guidance for surgical care during the COVID-19 pandemic',\n",
       "   'authors': [{'first': 'Covidsurg',\n",
       "     'middle': [],\n",
       "     'last': 'Collaborative',\n",
       "     'suffix': ''}],\n",
       "   'year': 2020,\n",
       "   'venue': 'Br J Surg',\n",
       "   'volume': '107',\n",
       "   'issn': '',\n",
       "   'pages': '1097--1103',\n",
       "   'other_ids': {}},\n",
       "  'BIBREF4': {'ref_id': 'b4',\n",
       "   'title': 'Colorectal cancer services during the COVID-19 pandemic',\n",
       "   'authors': [{'first': 'A', 'middle': [], 'last': 'Courtney', 'suffix': ''},\n",
       "    {'first': 'A', 'middle': ['M'], 'last': 'Howell', 'suffix': ''},\n",
       "    {'first': 'N', 'middle': [], 'last': 'Daulatzai', 'suffix': ''},\n",
       "    {'first': 'N', 'middle': [], 'last': 'Savva', 'suffix': ''},\n",
       "    {'first': 'O', 'middle': [], 'last': 'Warren', 'suffix': ''},\n",
       "    {'first': 'S', 'middle': [], 'last': 'Mills', 'suffix': ''}],\n",
       "   'year': 2020,\n",
       "   'venue': 'Br J Surg',\n",
       "   'volume': '',\n",
       "   'issn': '',\n",
       "   'pages': '',\n",
       "   'other_ids': {'DOI': ['10.1002/bjs.11706']}},\n",
       "  'BIBREF5': {'ref_id': 'b5',\n",
       "   'title': 'Safe management of surgical smoke in the age of COVID-19',\n",
       "   'authors': [{'first': 'N',\n",
       "     'middle': ['G'],\n",
       "     'last': 'Mowbray',\n",
       "     'suffix': ''},\n",
       "    {'first': 'Ansell', 'middle': ['J'], 'last': 'Horwood', 'suffix': ''},\n",
       "    {'first': 'J', 'middle': [], 'last': 'Cornish', 'suffix': ''},\n",
       "    {'first': 'J', 'middle': [], 'last': 'Rizkallah', 'suffix': ''},\n",
       "    {'first': 'P', 'middle': [], 'last': 'Parker', 'suffix': ''},\n",
       "    {'first': 'A', 'middle': [], 'last': '', 'suffix': ''}],\n",
       "   'year': 2020,\n",
       "   'venue': 'Br J Surg',\n",
       "   'volume': '',\n",
       "   'issn': '',\n",
       "   'pages': '',\n",
       "   'other_ids': {}}},\n",
       " 'ref_entries': {'TABREF0': {'text': 'Petr V. Tsarkov, Albina A. Zubayraeva, Yulia S. Medkova and Sergey K. Efetov I.M. Sechenov First Moscow State Medical University (Sechenov University), Clinic of Coloproctology and Minimally Invasive Surgery, Moscow, Russia DOI: 10.1002/bjs.11940 1 Her M. Repurposing and reshaping of hospitals during the COVID-19 outbreak in South Korea. One Health 2020; https://doi.org/10.1016/j.onehlt .2020.100137 [Epub ahead of print]. 2 McGrath BA, Brenner MJ, Warrillow SJ, Pandian V, Arora A, Cameron TS et al. Tracheostomy in the COVID-19 era: global and multidisciplinary guidance. Lancet Respir Med 2020; https://doi.org/10.1016/S2213-2600(20)30230-7 [Epub ahead of print].',\n",
       "   'latex': None,\n",
       "   'type': 'table'}},\n",
       " 'back_matter': []}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can read the `'body_text'`..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'text': 'According to current live statistics at the time of editing this letter, Russia has been the third country in the world to be affected by COVID-19 with both new cases and death rates rising. It remains in a position of advantage due to the later onset of the viral spread within the country since the worldwide disease outbreak.',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'The first step in \"fighting\" the epidemic was nationwide lock down on March 30 th , 2020.',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'Most of the multidisciplinary hospitals have been repurposed as dedicated COVID-19 centres, so the surgeons started working as infectious disease specialists. Such a reallocation of health care capacity results in the effective management of this epidemiological problem 1 . The staff has undergone on-line 36-hour training course to become qualified in coronavirus infection treatment.',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'The surgeons of COVID-19 dedicated hospitals do rarely practice surgery. When ICU patients need mechanical ventilation, percutaneous tracheostomy under endoscopic control is mostly performed, as it decreases the aerosol formation, viral load on staff and complications, associated with an endotracheal tube in comparison with surgical tracheostomy 2 . However, it is still associated with the risk of aerosol formation, so different approaches should be considered for a long-time perspective 3 .',\n",
       "  'cite_spans': [{'start': 493, 'end': 494, 'text': '3', 'ref_id': 'BIBREF0'}],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'The majority of the studies dedicated to colorectal diseases are temporarily paused. The teaching and training are mostly translated via online platforms, which has excluded the opportunity to get clinical experience in surgery 4 .',\n",
       "  'cite_spans': [{'start': 228, 'end': 229, 'text': '4', 'ref_id': 'BIBREF1'}],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'The approach to patient routing has changed significantly. If one is not diagnosed with COVID-19 CT scan and laboratory testing are provided immediately. The patients should be admitted to the surgical department, where treatment is provided only to those COVID-19 negative.',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'The patient isolated for more than 2 weeks and COVID-19 negative as a result of 2 subsequent tests is admitted to the surgical department with an option to readmission to the infectious department and can be treated by surgical staff, which does not work with COVID-19 positive patients.',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'The patient, diagnosed with coronavirus infection and treated at home is admitted to COVID-19 dedicated multidisciplinary hospital, where surgical care is provided. Those treated in infectious diseases hospital or COVID-19 dedicated centre are managed by the surgical team present.',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'Surgery has become highly elective, being mostly available for high-risk patients with emergencies, malignancies, cardiovascular pathologies or infections. Preoperative testing in surgical patients with respiratory symptoms and history of travelling or contacting with COVID-19 positive people and postoperative recovery in the operating unit seem to be highly effective measures 5 . A lot of rearrangements are performed locally regarding personal protective equipment, the organization of scrubbing, donning and doffing, and dedicated changing areas. Moreover, observational departments are organized in surgical hospitals for patient allocation before coronavirus infection status is defined 6 .',\n",
       "  'cite_spans': [{'start': 380, 'end': 381, 'text': '5', 'ref_id': 'BIBREF2'},\n",
       "   {'start': 695, 'end': 696, 'text': '6', 'ref_id': 'BIBREF3'}],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'Surgery for benign disorders, precancerous lesions, and reconstructive procedures are currently postponed. Regarding colorectal cancer, surgical treatment may be postponed, if it is a non-obstructing disease 7 .',\n",
       "  'cite_spans': [{'start': 208, 'end': 209, 'text': '7', 'ref_id': 'BIBREF4'}],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'Laparoscopic surgery and diathermy are limited as well. The importance of special operating theatre for COVID-19 patients with negative pressure ventilation, unidirectional laminar flow, as well as the use of smoke evacuation systems during surgery are taken into account 8 .',\n",
       "  'cite_spans': [{'start': 272, 'end': 273, 'text': '8', 'ref_id': 'BIBREF5'}],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'Such an electiveness of surgery is concerning, as it might cause a worldwide healthcare catastrophe in the post-pandemic era 5 . More efforts should be taken to expand the amount and types of surgical procedures performed.',\n",
       "  'cite_spans': [{'start': 125, 'end': 126, 'text': '5', 'ref_id': 'BIBREF2'}],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'},\n",
       " {'text': 'Due to the early preventive and corrective actions we have already reached the plateau in new cases curve, counting for up to 8 984 cases identified at the time of writing this paper (June 7 th , 2020), with a mortality rate of 1⋅5075%. These statistical outcomes are resulted by a 68-day lockdown, admission regime, and healthcare rearrangement. Thus, the multistep restriction lifting has already started to consistently recover in both social and economic aspects. ',\n",
       "  'cite_spans': [],\n",
       "  'ref_spans': [],\n",
       "  'section': 'Editor'}]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['body_text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(data['body_text'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a lot of text data here, *60* of these larger passages, let's try extracting them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['According to current live statistics at the time of editing this letter, Russia has been the third country in the world to be affected by COVID-19 with both new cases and death rates rising. It remains in a position of advantage due to the later onset of the viral spread within the country since the worldwide disease outbreak.',\n",
       " 'The first step in \"fighting\" the epidemic was nationwide lock down on March 30 th , 2020.',\n",
       " 'Most of the multidisciplinary hospitals have been repurposed as dedicated COVID-19 centres, so the surgeons started working as infectious disease specialists. Such a reallocation of health care capacity results in the effective management of this epidemiological problem 1 . The staff has undergone on-line 36-hour training course to become qualified in coronavirus infection treatment.']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "body_text = [\n",
    "    line['text'] for line in data['body_text']\n",
    "]\n",
    "body_text[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's perfect, we now have a very large repository of COVID related passages of text. We will use this logic when loading the dataset in our use-cases. All directories other than *document_parses* can be removed."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7 (main, Sep 14 2022, 22:38:23) [Clang 14.0.0 (clang-1400.0.29.102)]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
